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■ The capacity with which a system of independent neuron-like units represents a given set of stimuli 

' is studied by calculating the mutual information between the stimuli and the neural responses. Both 

discrete noiseless and continuous noisy neurons are analyzed. In both cases, the information grows 
£SJ , monotonically with the number of neurons considered. Under the assumption that neurons are 

independent, the mutual information rises linearly from zero, and approaches exponentially its 
maximum value. We find the dependence of the initial slope on the number of stimuli and on the 
sparseness of the representation. 



I. INTRODUCTION 



Neural systems have the capacity, among others, to represent stimuli, objects and events in the outside world. 
Here, we use the word representation to refer to an association between a certain pattern of neural activity and some 
external correlate. Irrespectively of the identity or the properties of the items to be represented, information theory 
provides a framework where the capacity of a specific coding scheme can be quantified. How much information can 
be extracted from the activity of a population of neurons about the identity of the item that is being represented at 
I 1 any one moment? Such a problem, in fact, has already been studied experimentally 0, |[ ||, ||, |[ ||, 0, |[ [| [l(], [Tl| . 
Typically a discrete set of p stimuli is presented to a subject, while the activity of a population of N neurons is 
recorded. At its simplest, this activity can be described as an N dimensional vector r, whose components are the 
firing rates of individual neurons computed over a predefined time window. The measured response is expected to 
be selective, at least to some degree, to each one of the stimuli. This degree of selectivity can be quantified by the 
mutual information between the set of stimuli and the responses p3 
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P(r) = ^P( S )P(r| S ). (2) 



The mutual information / characterizes the mapping between the p stimuli and the response space, and represents the 
amount of information conveyed by r about which of the p stimuli was shown. If each stimulus evokes a unique set of 
responses, i.e. no two different stimuli induce the same response, then Eq. ([!]) reduces to the entropy of the stimulus 
set, and is, therefore, log 2 p. On the other hand, if a response r may be evoked by more than one stimulus, the mutual 
information is less than the entropy of the stimuli. In the extreme case where the responses are independent of the 
stimulus shown, 1 = 0. 

In Fig. [j] we show the mutual information extracted from neural responses from the inferior temporal cortex of a 
macaque when exposed to p visual stimuli |j. Diamonds correspond to p — 20, squares to p — 9 and triangles to 
p = 4. The graph is plotted as a function of the number of neurons considered. Initially, the information rises linearly. 
As N grows, the increase of I(N) slows down, apparently saturating at some asymptotic value compatible with the 
theoretical maximum log 2 p. 

The behavior shown in Fig. [l] is quite common observation also in other experiments of the same type |l^, [ll] . 
From the theoretical point of view, different conclusions have been drawn, over the years, from these curves. Obviously, 
the saturation in itself implies that, after a while, adding more and more neurons provides no more than redundant 
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FIG. 1: Mutual information extracted from the activity of inferior temporal cortical neurons of a macaque when exposed to 
p visual stimuli. Diamonds correspond to p = 20, squares to p = 9 and triangles to p = 4. The graph is plotted as a function 
of the number of neurons considered, once an average upon all the possible permutations of neurons has been carried out. The 
theoretical maximum is, in each case, log 2 p = 4.32 bits, 3.16 bits and, respectively, 2 bits. The full line shows a fit of Eq. (^) 
to the case of p = 20. 

information. Gawne and Richmond |l| have considered a simple model which yields an analytical expression for I(N) 
under the assumption that each neuron provides a fixed amount of information, 1(1), and that a fixed fraction of such 
an amount, y, is redundant with the information conveyed by any other neuron. The model yields I(oo) = 1(1) /y. 
Rolls et. al fel have considered a more constrained model that in addition assumes that y = 1(1)/ log 2 p. Later it was 
shown that this is, in fact, the mean pairwise redundancy if the information provided by different cells has a random 
overlap . In this kind of phenomenological description, the information provided by a population of N cells reads 



It has also been suggested |8| that monitoring the linear rise for small N may tell whether the representation of 
the stimuli is distributed or local. In a distributed scheme many neurons participate in coding for each stimulus. On 
the contrary, in a local representation — sometiemes called grandmother cell encoding — each stimulus is represented 
by the activation of just one or a very small number of equivalent neurons. 

Here we present a theoretical analysis of the dependence of / on N for independent units. In contrast to the 
previous phenomenological description, wc model the response of each neuron to every stimulus. In Sects. II and III 
we derive I(N) for several choices of the single unit response probability. In Sect. IV we discuss the relation of the 
mutual information defined in Eq. (|l|) to an informational measure of retrieval accuracy. We end in Sect. V with 
some concluding remarks. 



In what follows, the issue of quantifying the mean amount of information provided by N units is addressed. To do 
so, the response of each unit to every stimulus is specified. From such responses, the mutual information is calculated 
using Eq. ([!]). Two types of models are considered. In this section we deal with discrete noiseless units, while in Sect. 
Ill we turn to continuous noisy ones. 

We consider N units responding to a set of stimuli. The response n of unit i is taken to vary in a discrete set of / 
possible values. The states of the whole assembly of N units are written as r £ 1Z, where r = (ri, rjy)- Throughout 
the paper, letters in bold stand for vectors in a iV-dimensional space. The total number of states in 1Z is therefore 
f N - 

The stimuli {s} to be discriminated constitute a discrete set S of p elements. For simplicity, we assume that they 
are all presented to the neural system with the same frequency, namely 




(3) 



II. DISCRETE, NOISELESS UNITS 
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In order to calculate the mutual information between S and 1Z we assume that each stimulus has a representation 
in 7Z. In other words, for each stimulus s there is a fixed TV-dimensional vector r s . Superscipts label stimuli, while 
subscripts stand for units. 

The fact that the neurons are noiseless means that the mapping between stimuli and responses is deterministic. 
That is to say, for every stimulus there is a unique response r s . Mathematically, 

Therefore, for every s € S there is one and only one r £ TZ. The reciprocal, however, is in general not true. If several 
stimuli happen to have the same representation — which may well be the case if too few units are considered — then a 
given r may come as a response to more than one stimulus. In order to provide a detailed description of the way the 
stimuli are associated to the responses, we define S r as the number of stimuli whose representation is state r. Clearly, 

J2Sr=P, (6) 
r 

and 

P(r) = * (7) 
P 

When the conditional probability (|^) is inserted in (|l|), the sum on the responses can be carried out, since only a 
single vector r = r s gives a contribution. The mutual information reads 

Thus, I is entirely determined by the way the stimuli are clustered in the response space. For example: 



• Consider the case where all stimuli evoke the same response. This means that all the r s coincide. Accordingly, 
S r s = p while all the other S r t vanish. There is no way the responses can give information about the identity 
of the representations, and 1 = 0. 

• If every stimulus evokes its distinctive response there are no two equal r s . This means that a number p of the 
S r are equal to one, while the remaining vanish. The responses fully characterize the stimuli, and I = log 2 p. 

• Consider the case of even clustering, where the representations are evenly distributed among all the states of 
the system. This, or something close to it, may in fact happen when the number of representations is much 
larger than the number of states p ^> f N ■ Thus, S r = pj f N , for all r, and I = log 2 (/ Ar ). This is the maximum 
amount of information that can be extracted when the set of stimuli has been partitioned in f N subsets, and 
the responses are only capable of identifying the subsets, but not individual stimuli. 



A. A local coding scheme 



We now consider another example, namely that of a local coding scheme, sometimes called a system of grandmother 
cells. In 1972 Barlow proposed a single neuron doctrine for perceptual psychology |jl5|| . If a system is organized in 
order to archieve as complete a representation as possible with the minimum number of active neurons, at progressively 
higher levels of sensory processing fewer and fewer cells should be active. However the firing of each one of these high 
level units should code for a very complex stimulus (as for example, one's grandmother). The encoding of information 
of such a scheme is described as local. 

Local coding schemes have been shown to have several drawbacks as their extreme fragility to the damage 
of the participating units. Nevertheless, there are some examples in the brain of rather local strategies such as, for 
example, retinal ganglion cells (only activated by spots of light in a particular position of the visual field ) or the 
rodent's hippocampal place cells (only responding when the animal is in a specific location in its enviroment |l7j]). 

We now evaluate the mutual information in such grandmother-cell scheme, making use of Eq. (JsJ) . For simplicity, 
we take the units to be binary (/ = 2). We assume that each unit j responds to a single stimulus s(j). Let us take 
that response to be 1, and the response to any other stimulus to be 0. All units are taken to respond to one single 
stimulus and, at first, we take at most one responsive unit per stimulus. Thus, for the time being, N < p. 
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Mutual information J as a function of the number of cells TV, for different sizes of the set of stimuli, in the case of 
encoding. For small TV, the information rises linearly with a slope proportional to 1/p. When TV = p — 1, I saturates 
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FIG. 3: Mutual information / as a function of the number of recorded units TV, once averaged over all the possible selections 
of TV cells picked up from a pool of M (the latter constituted of M/p units responding to each stimulus). Different curves 
correspond to various values of M, and p = 32. 



This particular choice for the representations means that out of the 2 N states of the response space, only a subset 
of TV + 1 vectors is ever used. Actually, So = p — TV, while for all one-active-unit states e, S e = 1. For the remaining 
responses, S r = 0. Therefore, the mutual information reads 



I = — log 2 (p) + log 2 - 

p p \p — TV 



(9) 



In Fig. g we show the dependence of I on the number of cells, for several values of p. It can be readily seen that for 
TV <p 



p m2 

In the limit of large p Eq. ([To|) coincides with the intuitive approximation 



/(TV) = TV/(1) = TV 



-log 2 pH log 2 

p p 
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(11) 



A linear rise in /(TV) means that different neurons provide different information, or, in other words, that there is 
no redundancy in the responses of the different cells. As seen in Fig. 0, this is, in fact, the case when TV is small 
and p is large enough. When a cell does not respond, it is still providing some information, namely, that it is not 
recognizing its specific stimulus. When two cells are considered, a part of this non-specific information overlaps with 
the information conveyed by the second cell, when responding. In other words, if two cells respond to different stimuli 
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then, when one of them is in state 1, the other is, for sure, in state 0. Therefore, strictly speaking, the information 
provided by different neurons in a grandmothcrlikc encoding is not independent. However, in the limit of N/p — ► the 
number of stimuli not evoking responses in any single cell is large enough as to make the information approximately 
additive. 

As N approaches p, such an independence no longer holds, so the growth of I(N) decelerates, and the curve 
approaches \og 2 p. For N = p — 1, the mutual information is exactly equal to log 2 p, and remains constant when more 
units are added. In fact, p — 1 noiseless units are enough to accurately identify p stimuli. If all p — 1 are silent, then 
the stimulus shown is the one represented by the missing unit. 

In a slightly more sophisticated approach, each unit can have any number of responses /. But as long as the 
conditional probability P(rj\s) is the same for all those s that are not s(J), Eq. (^|) still holds. 

It should be kept in mind that up to now we have considered the optimal situation, in that different units always 
respond to different stimuli. If several cells respond to the same stimulus, a probabilistic approach is needed since 
otherwise, the growth of I(N) depends on the order in which the units are taken. Averaging over all possible selections 
of N cells from a pool of M units (the whole set is such that there are M/p cells allocated to each stimulus) the 
result shown in Fig. [?] is obtained. We have taken p — 32, and different curves correspond to various values of 
M. The probabilistic approach smoothes the sharp behavior observed in Fig. |^. Actually, the asymptote log 2 p 
can only be reached when there is certainty that there are p — 1 units responding to different stimuli, that is, for 
N = 1 + (p — 2)M/p. However, it is readily seen that with M/p as large as 5, the curves are already very near to the 
limit case of M/p — > oo. 

B. Distributed coding schemes 

As an alternative to the local coding scheme described above, we now treat the case of distributed encoding, 
ranging from sparsely to fully distributed. However, in doing so, we employ a different approach, namely, we average 
the information upon the details of the representation. 

Equation (^J) implies that the amount of information that can be extracted from the responses depends on the 
specific representations of the p stimuli. Since it is desirable to have a somewhat more general result, we define an 
averaged mutual information (7) 

(/)= P o(r\-y) I, (12) 

r 1 , .-•«•*' 

where the mean is taken over a probability distribution i"b( rl j rP ) of having the representation in positions r , r p . 
This distribution, of course, is determined by the coding scheme used by the system. By averaging the information 
we depart from the experimental situation, where the recorded responses strongly depend on the very specific set 
of stimuli chosen. But, in return, the resulting information characterizes, more generaly, the way neurons encode a 
certain type of stimuli, rather than the exact stimuli that have actually been employed. 
We write Pq as a product of single distributions for each representation, 

P (r\ ...,!•*>) =II£ =1 P 1 (r s ). (13) 

This implies that the representation of one item does not bias the probability distribution of the representation of 
any other. In this sense, we can say that Eq. ( |l3|) assumes that representations are independent from one another. 

If in one particuar experiment the set of stimuli is large enough to effectively sample P\(r s ) the averaged information 
will be close to the experimental result. 

We further assume that there is a probability distribution p(rj) that determines the frequency at which unit j goes 
into state r.j (or fires at rate rj). If p is strongly peaked at a particular state — which can be always be taken as 
zero — the code is said to be sparse. On the contrary, a flat p gives rise to a fully distributed coding scheme. 

Finally, we assume that different units are independent. In other words, we factorize the probability that a given 
stimulus is represented by the state r as 

Pi(r) = Hf =1 p( rj ). (14) 

In order to average the information (||) we need to derive the probability that stimuli are clustered into any possible 
set of {S r }. Such a probability reads 



P({S})= H r [^(r)] 5 ', 



(15) 
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where 

f P \ _ P l 



\{S} J n r S r V 
Therefore, the average mutual information may be written as 



(16) 



(7> = £P({S}) I. (17) 

{S} 

The summation runs over all sets {S} such that ^ r = P- Replacing Eq. (||) in (|l7|), we obtain 

«- m 4 ?* 1 *^)- (18) 

Rearranging the summation so as to explicitly separate out a single S T . one may write 

where A is the sum over all other S, namely 

A = E ff^ r } )n r ^ r [P 1 (r'')] S '-''=[l-Pi(r)r S ''- (20) 



Thus, 



W = E E f,.-niif?-",!-f S -i)ii ^ f £ ) t Pl t 1 - t Pl ^ ■ 



r 5 r: 



^(S r -l)![(p-:i) -(&-!)]! *'\S, 



We now discuss two particular cases of Eq. (pl|). First, we take the encoding to be fully distributed, namely 
p(rj) = 1//. Therefore, P\{Tj) — 1/ f N . If this is replaced in the previous expression, we obtain 

It may be seen that the dependence of the information on / and N always involves the combination f N . This means 
that neither the number of units, nor how many distinctive firing rates each unit has are relevant in themselves. Only 
the total number of states matters. 

In Fig. |] we plot the relation between (l)dis and N for several values of p. Initially the information rises linearly 
with a slope only slightly dependent on p. As N increases, (I)dis eventually saturates at log 2 p. The limit cases are 
easily derived 

lim (I) dis = N(p- l)ln/log 2 (-?-r) (23) 

IVln/->0 \p — 1 J 

Jim (I) dis = log 2 p-(p-l)f- N , (24) 
If the number of stimuli is large, Eq. ( |23| ) becomes 

lim lim (7) dis = N^- log 2 /. (25) 

Nln/^O p-»«T p 

Notice that in contrast to the local coding scheme Eq. (^), the initial slope of I(N) hardly depends on p (actually, it 
increases slightly with p) . This makes the distributed encoding a highly efficient way to read out information about 
a large set of stimuli by the activity of just a few units. 

As opposed to the fully distributed case, a sparse distributed encoding is now considered, with / = 2, p(l) = q, 
p(0) = 1 — q and g< 1. This choice is again a binary case, but with one response much more probable than the other. 
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FIG. 4: Mean mutual information (I)dis as a function of the number of neurons N for several values of p. Initially the 
information rises linearly with a slope only slightly depending on p. As N increases, (I)dis eventually saturates at log 2 p. 



As a consequence, the most likely representations in 1Z space are those with either zero or at most one active neuron. 
In fact, Pi(r s = 0) = (1 — q) N , whereas if the representation is a one-active-unit state e, -Pi(e) = q(l — q) 1 ^^ 1 . The 
probability of all other representations is higher order in q. 

Accordingly, to first order in q, we only consider the combinations of p representations with at least p — 1 of them in 
state r s = 0. These are the only responses with a probability Pq at most linear in q. More precisely, the probability 
of representing all p stimuli with the same state r = is i-b(0,0, ...,0) = [Pi(0)] p w 1 — Npq. In the same way, 
the probability of having N — 1 stimuli in and a single one-active-unit state is q. There are N different possible 
one-active-unit states, and any one of the p stimuli can be such a state. Taking all this into account, we find that up 
to the first order in Npq, 

(I) spa. = Npq 

Expanding this expression for large p, we obtain 

lim(7)=7V g l±^. (27) 
p^oo in 2 

This means that from the experimental measurement of the slope of (I(N)) it is possible to extract the sparseness of 
an equivalent binary model, which can be compared with a direct measurement of the sparseness. If the number of 
stimuli cannot be considered large, the whole of Eq. ( |26| ) can be used to derive a value for q. 

It should be noticed that if q = l/p Eq. ( pf ) coincides with the expression ( p"l| ) for a grandmother-like encoding. 
This makes sense, since q = l/p implies that, on average, any one unit is activated by a single pattern. In short, it 
corresponds to a probabilistic description of the localized encoding. Notice, though, that q = l/p is outside the range 
of validity of our limit Npq <C 1 • 



p-1 



log 2 



p-1 



log 2 p 



(26) 



III. CONTINUOUS, NOISY NEURONS 



In this section we turn to a more realistic description of the single neuron responses. Specifically, we allow the states 
Tj to take any real value. Therefore, the response space TZ is now Sft . In addition, we depart from the deterministic 
relationship between stimuli and responses. This means that upon presentation of stimulus s, there is no longer a 
unique response. Instead, the response vector r is most likely centered at a particular r s , and shows some dispersion 
to nearby vectors. The aim is to calculate the mutual information between the responses and the stimuli requiring 
as little as possible from the conditional probability P(r|s). A single parameter a is introduced as a measure of the 
noise in the representation. Thus, 



e 



P(r| S )=nf =1 - - , (28) 

V27TCT 2 



where the index s takes values from 1 to p. The conditional probability depends on the distance between the actual 
response r and a fixed vector r s £ TZ, which is the mean response of the system to stimulus s. There is one such r s for 
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every element in S. The choice of Gaussian functions is only to keep the description simple and analytically tractable. 
By factorizing P (r|s) in a product of one component probabilities an explicit assumption about the independence of 
the neurons is being made. 

Figure [| shows a numerical evaluation of the information (]l|), when the probability P(r\s) is as in (^). The 
information, just as in the previous section, has been averaged upon many selections of the representations r s . The 
curve is a function of the number of neurons considered N. Different lines correspond to different sizes of the set of 
stimuli, while in (a) a = A/2, and in (b) a = A, where A is a parameter quantifying the mean discriminability among 
representations, to be defined precisely later. Just as in the discrete distributed case, we observe an initial linear rise 
and a saturation at log 2 p. Moreover, and pretty much as in the experimental situation of Fig. [l], the initial slope 
does not seem to depend strongly on the number of stimuli, at least for large values of the noise a. In what follows, 
an analytical study of these numerical results is carried out. In particular, the relevant parameters determining the 
shape of I(N) are identified. 

We write the mutual information as 



I = H X - H 2 , 



(29) 



where 



H x = -- [drP (r\s) log 2 - £ P (r\s') = - f dr P (r) log 2 [P (r)] 

P s=l J l P s' = l J J 

is the total entropy of the responses, and 

1 p f 

E* = — E / drP(r\s) log 2 [P (r\s)} . 



(30) 



(31) 



is the conditional entropy of P (r|s), averaged over s. 
Hi can be easily calculated. It reads 



H, = 



N 



2 In 2 



[1 + ln(27R7 2 )] 



(32) 



It is therefore linear in N. This stems from the independence of the units, since the entropy of the response space 
increases linearly with its dimension. It does not depend on the location of the representations r s , and it is a growing 
function of the noise a. 

In Appendix A we solve the integral in r of H\ using the replica method. We obtain 



lim 



1 



ln2n-On I | p"+ 1 (2Tra 2 ) Nn / 2 (n + l) N /< 



E 

{K} 



71+1 

{K} 



(33) 



x HjLi ex P 



v ' 1=1 m—\ J ) 



where {K} now stands for the set {K±, ■■■K p \ specifying how many replicas are representing each pattern. The 
summation in {K} runs over all sets of K such that X)s=i ^ s = ^ + 1- The symbol in brackets is defined in jl^). 
Equation ( |33| ) shows that the information depends explicitly on the ratio between all the possible differences \r e — r m \ 
and the noise a. In other words, the capacity to determine which stimulus is being shown is given by a signal-to-noise 
ratio, characterizing the discriminability of the responses. 

The mutual information / characterizes the selectivity of the correspondence between stimuli and responses. If the 
distance between any two vectors \r e — r m \ is much greater than the noise a, then the mapping is (almost) injective. 
Thus, in this limit the mutual information approaches its maximal value, log 2 p. 

If, on the other hand, the noise level in P (r|s) is enough to allow for some vectors r to be evoked with appreciable 
probability by more than one stimulus, the mutual information decreases. In this sense, I can be interpreted as a 
comparison between the noise in P (r|s) and the distance between any two mean responses. For a specific choice of 
the representations, the distance between any two of them is a non linear function of their components. Therefore, in 
general, even though Eq. ( pSj ) implies that different units are independent, it is not possible to write I as a sum over 
units of single-units information. 
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Just as before, we now average the mutual information ([!]) over a probability distribution Pq^ 1 , ...,r p ) of the 
representations r 1 , r p , namely 



(/> = J nf =1 dr> P (r\...,r*) I. 



Under the assumption that the responses to different stimuli are independent, Po reads 

P (r\...,rP) =U p s=1 P 1 (r s ). 
Adding the requirement of independent units, 

Pl (r*) = nf =1 p (rj) . 
By replacing the average ( |l2| ) in the separation ( p9| ) we write 

since H 2 does not depend on the vectors r s . 

So we now turn to the calculation of (Hi), namely 



(Hi) = - 



1 

lim — 

n^o n 



(n+l) N / 2 (2ir<r 2 ) Nn / 2 p n + 



{K} 



n + 1 
{K} 



(A {K} ) 



N 



where 



(A {K} )= J U p s=1 dr s p(r s ) exp 



4(n+ 1)<j- 



(34) 

(35) 
(36) 
(37) 

(38) 



(39) 



The main step forward introduced by the average in ( |12[ ) is that now, (Hi) is symmetric under the exchange of any 
two responses, or any two neurons. In contrast, before the averaging process, the location of every single response by 
every single unit was relevant. 

The limit in Eq. (|3^ ) can be calculated in some particular cases. In the first place, we analyze the large N limit. 
From Eq. ( J3S| ) it is clear that (A^}) < 1- The equality holds, in fact, only when there is a single K different from 
zero. In the calculation of (Hi), as stated in Eq. (|38|), ^4{_ff} appears to the iV-th power. Therefore, when N — > 00 
only the terms with A^xy = 1 give a non-vanishing contribution. There are p of such terms. When the sum in ( |3"§| ) 
is replaced by p, it may be shown that once more, (I) = log 2 p. 

In the following two subsections we compute (I(N)) for both large and small values of the noise a. 



A. Information in the large noise limit 



We now make the assumption that the noise a is much larger than some average width of p (r). In other words, we 



suppose a ^> (? 



', for all r and r m with non- vanishing probability. In this case, the exponential in ( p9|) may 



be expanded in Taylor series. Up to the second order 

p v 

4(n+ l)~ ' 



exp 



e.\2 



m=l i=\ 



4(n + l) 



m=\ 1=1 

V V 



4(n 



f\2 



m=l 1=1 



(40) 



If only the constant term is considered, the integral in Eq. ( p9[ ) becomes the normalization condition for P . Thus, 
the sums in ( |38| ) give p n+l , and it is readily seen that (Hi) exactly cancels H 2 - As expected, in the limit a 2 — > 00 the 
mutual information vanishes. 
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FIG. 5: Results of the numerical evaluation of the mutual information for continuous noisy neurons, where p is the number of 
stimuli in the set. In (a) a = A/2, and in (b) a = A. 



The next order of approximation is to consider the expansion ( |40| ) up to the linear term. Thus, the integral in (|39| ) 
becomes 

(^w> = 1 ~ IfaTTj^ Z £ KmK ^ (41) 

where 

A 2 = J dr 1 dr 2 p (r 1 ) p (r 2 ) (r 1 - r 2 f (42) 

is the parameter quantifying the discriminability among representations, and appearing in Fig. |5|. We have now 
gained a more precise insight of the large a limit. It stands for taking a 3> A. 

Since in Eq. ( |38| ) A{k} appears to the 7V-th power, in order to proceed further we have to estimate the size of 
NX 2 /a 2 . We first consider the small N limit and assume, to start with, that NX 2 /a 2 <C 1. Thus, we may expand 

(W^-iT^^E t ^K, (43) 
In Appendix B we calculate the sums in ([43), thus obtaining (Hi). When the result is replaced in (p8|) we get 



N (p-1) fX 



2 



For a large amount of noise, the information rises linearly with the number of neurons. This dependence should be 
compared with Eq. ((2^), in the discrete distributed case. The two expressions coincide, if the number of discrete states 
/ is associated to exp(A 2 /4er 2 ). Therefore, as regards to the mutual information, a dispersion a in the representation 
is equivalent to having a number exp(A 2 /4er 2 ) of distinguishable discrete responses. Notice that both the noisy, 
continuous and the discrete, deterministic approch show the same dependence on the number of representations. 

Regarding the dependence on cr, it is readily seen that as the noise decreases, the slope of / increases. In other words, 
every single neuron provides a larger amount of information. Since the mutual information saturates at log 2 p for 
N — > co, a small value of a implies that the ceiling is quickly reached. As a consequence, the assumption N -C a 2 jX 2 
can now be more precisely stated as N <C (<r 2 /A 2 ) log 2 p. In this regime, linearity holds. 

As N increases, saturation effects become evident, and the mutual information is no longer linear. The first hint 
of the presence of an asymptote at log 2 p is given by the quadratic contribution to I(N). In order to describe it, the 
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whole of expansion ( [l0| ) must be replaced in Eq. (|39|). Carrying out the integral in r 1 , ...,r p , 

p p 



(A {K} ) = 1 



A 



4a 2 (n+ 1) 



n E E 



32ff 4 (n + l) 2: 



where 



-l m =l 



nLi p( rS ) drS 



Extracting the sums from the integral, the limit in Eq. (|38|) can be solved, and 



{!) = 



N p- 1 
hi2 p 



r \2 



4cr 2 2(4cr 2 ) 2 



C 



iV 2 / A 2 \ p-1 



Here, 



2A 4 / 2 2 

C = — - 2Ai I 1 - - + -J 
P \ P P 

4A 2 fc^ f H _ l 
P \P 



2A : 



(p-2)(p-3) 



with 



Aj 
A 2 
A 3 



d^dr 2 ^^ 1 ) p(r 2 ) (r!-r 2 ) 4 

dr 1 dr 2 dr 3 p (r 1 ) p (r 2 ) p (r 3 ) (r 1 - r 2 ) 2 (r x - r 3 ) 2 

dr 1 dr 2 dr 3 dr 4 p (r 1 ) p (r 2 ) p (r 3 ) p (r 4 ) (r 1 - r 2 ) 2 (r 3 - r 4 ) 2 . 



(45) 



(46) 



(47) 



(48) 



(49) 



Our numerical simulations corroborate that if a quadratic function is fit to the initial rise of I(N), the coefficients 
accompanying N and -/V 2 depend on p and a just as predicted by Eq. (ft7|). 



B. The limit of vanishing noise 

In the first place, we take a — * 0. If the conditional probability ( p8| ) is replaced by a ^-function, it is readily seen 
that I — log 2 p. 

In Appendix C we show that for small — but not vanishing — values of the noise a, the mutual information is expected 
to grow as 



(I) =log 2 (p) 



where 



Bo 



log 2 p 



p 2 (r) dr. 



(50) 



(51) 



In order to corroborate this result, we have fit a function of the form log 2 (p)[l — aexp(WV)] to the numerical 
evaluation of Eq. (|37]). In Fig. ^ we show the dependence of a and b with a and p. We observe that coefficient 
a shows a dependence with the noise a, in contrast to what is predicted by Eq. (^0|). It is also in contrast to the 
prediction of the phenomenological model leading to Eq. (J3J) , where a = 1. In addition, b shows a variation with the 
number of stimuli p. Thus, although it is very easy to calculate the mutual information when a is exactly equal to 0, 
we have not been able to derive analytically the approach to the log 2 p limit, as a — > 0. 
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FIG. 6: Dependence of the coefficients a and b a extracted from numerical evaluations of the mutual information, with the 
parameters p and a. 



IV. A RELATED INFORMATIONAL MEASURE OF ACCURACY 



Up until now, we have considered the mutual information of Eq.(jl]), a quantifier of the capacity with which a given 
group of units can represent a fixed set of p stimuli. This is a measure of direct relevance to neuronal recording 
experiments. A somewhat different information measure has been used in analysing mathematical network models, 
in particular models of memory storage and retrieval. We would like to clarify the relationship between the two 
measures. 

Consider the variability with which a typical stimulus is represented, which in a mathematical model might be 
described by a formula as simple as Eq.(|2^). There, r is the response during a trial, while r s is the average response 
across trials with the same stimulus. The average variability may be quantified by the mutual information between r 
and r s , 



I = dr s P(r s ) / dr P(r\r s ) log. 



P(r|r s 



(52) 



where r s is taken to span the space of average responses, described by the probability distribution P(r s ). In a 
different model, r s might be the first response produced, and r the second, or any successive response; in yet other 
models [[l8| |l9|, |2(], pl| , r s might be the stored representation of a memory item, and r the representation emerging 
when the item is being retrieved. In all such cases, one need not refer to a discrete set of p stimuli, but only to a 
probability distribution P(r s ) (and, of course, to a conditional probability distribution P(r|r s ) ). This measure of 
accuracy is simply related to the mutual information we have considered in this paper: it is given by its p — > oo 
limit. In particular, the initial linear rise of / with iV is the only regime relevant to the accuracy measure, which for 
independent units is always purely linear in N. 

Let us see this in formulas. Just as before, we assume that P(r s ) factorizes as 



P(r s ) 



nf =1 pft). 



The equivalent of (0) is now 



P(r) = / dr s P(r s ) P(r\r s ). 



(53) 



(54) 



(55) 



In Appendix D we show that 

f _N_2^. 
~ ln2 4a 2 ' 

In the derivation of Eq. ( |55| ) no assumption of small N has been made. By comparison with Eq. ( f44| ) we see that, 
indeed, the information measure (|5^) introduced in this section coincides with the initial rise of the information about 
which stimulus is being shown (Sect. Ill), when the latter is calculated for a large number of stimuli. 
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V. SUMMARY AND DISCUSSION 

The capacity with which a system of N independent units can code for a set of p stimuli has been studied. More 
precisely, the growth of the mutual information / between stimuli and responses has been calculated, for different 
models of the neural responses. In all these models, the units were supposed to operate independently. That is to say, 
the conditional probability of response r given stimulus s is always a product of single-unit conditional probabilities. 
Of course, the fact that neurons operate independently does not mean that they provide independent information. 
As stated in Eq. (^9|), the mutual information can always be separated into the difference between the entropy of the 
responses (Hi) and the averaged stimulus specific entropy (H 2 ), sometimes called noise entropy. For independent units, 
H 2 is always linear in N. However, the factorization of the conditional probabilities does not imply the factorization 
of P{v), meaning that Hi need not be linear in the number of units. In other words, even independent units may 
produce correlated responses, and indeed strongly correlated, simply because every unit is driven by the same set 
of stimuli. Imagine that each unit provides a very precise representation of the stimuli. If stimulus 1 is shown, the 
responses of the N units will show almost no trial to trial variability. When the stimulus is changed, another set of 
N responses is obtained. But the first responses always come together (driven by stimulus 1), and so do the second 
ones. Even after averaging over all stimuli, this coherent behavior implies strong correlations between the responses. 
In this example, H 2 ~ and Hi w log 2 p. 

In other situations, when the number of stimuli is very large, or the representation of each one of them is noisier, 
the correlations in the responses are weaker. We have seen that, in these cases, Hi tends to become linear in N. 

Throughout the work, the responses of the units was described by a vector r. Nothing was said, however, about 
what the components of the vector really are. In the experiment of Fig. [l], rj was the firing rate of neuron j in a 
pre-defined time window. One might however consider a slightly more complex description in which a subset of M 
components is associated to the response of unit j. For example, the first M principal components of its time course 
0. Our analysis would still apply, replacing N units by MN components. 

In the Introduction, reference was made to the phenomenological models where the growth of I(N) as given by Eq. 
(||) is entirely explained by ceiling effects. In such models, the information provided by different neurons is supposed 
to be independent, inasmuch this is compatible with the fact that the total amount of information must be log 2 p. The 
models presented in this paper are not in principle opposed to the phenomenological ones; rather they are at a more 
detailed level of description. Instead of a direct assumption on how different units share the available information, 
we specify conditional probabilities for the responses. As a result, we find global trends that closely resemble those 
of Eq. (H), that is to say, an initial linear rise and an exponential saturation at log 2 p. The detailed shape of I(N) is, 
however, different for each model. 

It should be kept in mind that whatever the detailed shape of the curve, the approach to log 2 p is no more than a 
consequence of the fact that the number of stimuli is limited. The maximum information that can be extracted from 
the neural responses is log 2 p. It is clear that if we have a set of neurons that already provides an information very near 
to this maximum, by adding one more neuron we will gain no more than redundant information. In other words, we 
have reached a regime where the neural responses correctly distinguish the identity of each stimulus. But we cannot 
deduce from this that the representational capacity of the responses remains unchanged when the number of neurons 
increases. One should rather realize that the task itself is no longer appropriate to test the way additional neurons 
contribute in the encoding of stimuli. In contrast, the slope of the initial linear rise is an accurate quantification of 
the capacity of the system to represent items. 

We have found that distributed coding schemes result in an initial slope that is roughly independent of the number 
of stimuli. This means that the number of units needed to reach a given fraction of the maximum information scales as 
log 2 p — at least, for large p. In contrast, when a grandmother-cell encoding is used, the initial slope is proportional to 
1/p, and hence, one should have N oc p. This makes distributed encoding much more efficient than localized schemes. 
In the example of the experiment of Fig. [j], the information measure supports the conclusion, already evident from the 
responses themselves, that the representation of faces in the inferior temporal cortex of the macaque is distributed. 
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APPENDIX A: CALCULATION OF Hi USING THE REPLICA METHOD 



Replacing the identity 



In a = lim — (a™ — 1) 

n^o n 



in (30) the integral in r can be evaluated. This we show in the present appendix. 



p ( ~ p ~ 71 \ 

~^ P J hi 2 n^o n I p J 



lim — 



1 ( H la 



In 2 n— >Q n \p 



where 



P r 

8=1 ^ 



£p(iy) 



'=1 



e - e n ^=i ^o')' 



and 



Sl=l S„+l=l 



n+1 



^^0') = ( 2 ^2) 1 (n + l)/2 / dT i 6XP ~ 2^ g ( 



r/) 



(Al) 



(A2) 



(A3) 



(A4) 



is a factor that depends on the j-th component of one particular way of distributing p stimuli among the n+1 replicas. 
To calculate it we observe that 



n+l 



5>i-rf) a = (n + l) 



fe=i 



-, n+l 



(A5) 



where £j is a vector of n+ 1 components such that £^ = r^ fc . The vector notation is used for arrays of n+l components. 
The matrix A has dimensions (n+l) X (n + l), and reads 



A = I, 



n+l 



-U, 



(A6) 



where U is an (n + 1) x (n + l) mat rix, whith all its coefficients equal to unity. 



Thus, the quadratic factor in ( |A5| ) can be extracted outside the integral in (|A4j), and 

-1 



(A7) 



Replacing this expression in (|A3j) 



,- N P P 1 N 



Si = l S„ + l = l 



(A8) 



j'=i 



We now re-arrange the summation in (A8), according to the number d of different stimuli appearing in the n + l 
replicas. For each realization of s%, S2, ...s n _|_i, the replicas can be divided in d classes, such that all the replicas 
belonging to the same class are associated to the same stimulus, and replicas of different classes correspond to 
different stimuli. The number of replicas adscribed to stimulus j is Kj. Clearly, the sum of all the Kj is n+l, and 
only d of the Ki are different from zero. Therefore, 



p p 



EE- E =E 

Sl=l«2=l S n+ l = l {K} 



n+l 
{K} 



(A9) 
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where the term in brackets is defined in (16), and the (n + l)-fold summation involves all possible sets of K\, ...K p 
ranging from to n + 1, and whose total sum is n + 1. 



The advantage of this rearrangement is that the exponent in ( A8) can be written as a function of only the differences 
between representations, namely 



= 2fnTT) E E KMr ? ~ ^ 

V ~ I m =l 1=1 



m „l\2 



Therefore, replacing equations (A9) and ( A1C ) in (AS) we arrive at Eq ( |33| ) 



(A10) 



APPENDIX B: INITIAL RISE OF (I(N)} IN THE LARGE NOISE LIMIT 

Replacing @ in ([$§), we get 



(Hi) = - — lim - 



1 



In 2 rwo n \ (n + l) M / 2 (2 7 rcr 2 ) Ar "/2 p 



,71+1 



n+1 



A 2 iV 



4(n+ l)cr 2 



- 1 



with 



* = E 

{K} 



n + 1 
{K} 



E E K ™ K * 

m=l l=l,£^m 



In order to compute S we interchange the order of summation 

p v 



*=E E E 

m=l e=l,i^m{K} 



n + 1 

m 



K m K, 



(Bl) 



(B2) 



(B3) 



The terms with Kg or K m equal to zero, do not contribute to S. Therefore, we can restrict the sum in (B3) to 
K m 7^ K-l- Thus, the addition over all K's ranging from to n + 1 whose total sum is n + 1 can be replaced by 
another addition, where all K's different from Kg and K m range from to n — 1, K m and Kg go from 1 to n, and the 
sum of all the K's is n + 1. Since there are p(p — 1) choices for Kg and K m , 



S = p(p — l)(n + l)np r ' 



Replacing equation (B4) in (Bl) we get 



(Hi) 



In 2 



N at ,„ 2 ; 

- + -hi(2n<j 2 ) 



N p- 1 / A 

m~2 p V^o- 



(B4) 



(B5) 



When i? 2 is summed to (Hi), equation (|3) is obtained. 



APPENDIX C: THE SMALL a LIMIT 



We go back to Eq. (§|). We re- write (||) as 

<A {K} ) - J IKJ =1 dr>(r s ) exp 
where x is a vector of p components, such that \s = 



1 



2(n+ 1)cj 2 



(CI) 



and 



M = (n + 1) 



/#i 
X 2 

\ 






K n 



(KiKi KiK 2 ... K X K V 
K 2 K X K 2 K 2 ... K 2 K p 

\K p Ki K p K 2 ... K p K p 



(C2) 



The integrand in (CI) is 1 in the origin, and also along the eignenvectors of M corresponding to a zero eigenvalue. 
The number of such eigenvalues is equal or larger than the number of K that are zero. We therefore re-arrange the 
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numbering of the representations in such a way as to put all those with K different from zero in the first d places. 
Thus, Kd+i = Kd + 2 = ■■■ = K p = 0. With this ordering, matrix M is filled with zeros in all those positions with a 
row or a column greater than d. Integrating in r d+1 , r d+2 , ...,r p we get 



(A {K} )= / n d s=1 dr s p(r s ) exp 



1 



2(n + l)er 2 



X' *M'x> 



(C3) 



where x' an d M' are defined as x an d M, but live in a space of d dimensions (and not p). 

In order to integrate Eq. (C3) we observe that M' has a single eignevalue Ai equal to zero, with eigenvector 



Wi 



Vd 



(C4) 



We call W2,---,Wd all the other eigenvectors corresponding to nonvanishing eigenvalues A2,...,A<j- We choose the 
eigenvectors normalized, and orthogonal to each other and to u>i (the symmety of M allows us to do so). With this 
set of vectors we construct a new basis, and call w the collection of coordinates in this new system. We define a 
matrix C as the change of basis 



where 



C = 



X = Civ, 



I 1/Vd C12 ... Ci d ^ 
1/Vd C 22 ■■■ C 2 d 



\1/Vd 



Cd2 



and det(C) = 1. In this new basis, 

(A{ K }) = / n i=l dWjp(lVi/Vd + Cj2W2 + 



Odd } 



CjdWd) exp 



1 \ -< Xe 

~2 ^ a 2 (n+ 1) 



Multiplying and dividing by the product of all 2ir(n + l)a 2 /Xe, for I £ [2, d], we get 



>< / n? =1 



dwjp(wi/Vd + Cj2U)2 + ... + Cj d Wd) 



exp 



1^ A fc 2 

"2 2^fe=l <7 a (n+l) "'fc 



1^=2 V £ 



In the limit a — > 0, the integrand in ( |Cq ) includes d — 1 delta functions. Once integrated, 



limU {JO ) = TL% 



j=2 



/27TCT 2 (n+ 1) 

a" 



p(wi/Vd) 



It may be shown that 



Thus, 



n? = 2 Xj = d(n + l) d ~ 2 U d e=1 K e . 



lim (Ar K \) — {2tt(j' 

er— >0 



( ; + 1)1/2 



where 



S d = J dx [p{x)] d . 



(C5) 



(C6) 



(C7) 



(C8) 



(C9) 



(CIO) 



(Cll) 



(C12) 
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We now turn to the calculation of 



{K} 



N 



(C13) 



where, as before, the summation runs over all sets of {K} that add up to n + 1. Equation ( Cll ) states that (A^ K j) 
depends on d, that is, on the number of K that are different from zero. Therefore, we write the sum in ( C13| ) as 



n+l 



* = E 



where 



1 p! 



*=E 

{K}> 



(2 7r a 2 )( d - 1 )/ 2 V^+TB d 5 1; 



iV 



n+l 



-, N/2 



The sum in S\ involves only the d values of K that are different from zero. We now make the approximation 



But 



And, taking the limit 



E 

{K}> 



Si 



n + l 
{K} 



, \ Nd/2 

d \ 1 ^ I n + l 



n + l 



E 

{K}< 



{K} 



= E(-!) j 



d\ 



d\(d-j) 



l^o E [{K} )=^i+-E 

{K}> 



d\ 



3=0 



{d-m 



{d-j) n+ \ 



(C14) 



(C15) 



(C16) 



(C17) 



(C18) 



Moreover, if N is large, as d grows {B,j) N — > 0. Therefore, keeping just d = 1 and <i = 2 we may approximate 

S m p + p(p - 1) In 2 (2w 2 ) w/2 (n + l)^ 2 (Ba)* »• (C19) 
Replacing in Eqs. ([38|) and (p9|) we arrive at (EC 



APPENDIX D: INFORMATION BETWEEN THE ACTUAL RESPONSE AND THE STORED 

REPRESENTATION 



The aim is to calculate ( |52| ) under the assumption (|2q). Replacing ( |28| ) in (]54]) the probability P(r) can be written 



as 



where 



P(r)=njL, «rj), 



(r 3 -r 3 °) 2 /2<r 2 



Just as before, we separate 



/ = /fx - ff 2 , 



where 



H 2 = -Jdv°J dr P(r|r )P(r°)log 2 [P(r|r )] = [l + ln(2 7 r ( 7 2 )] , 

ff x = - j dv a f dr P(r|r°)P(r°)log 2 [P(r)] = -N J dt C(t) log 2 [£(*)] 



(Dl) 

(D2) 

(D3) 
(D4) 
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Inserting the definition (D2) of ((t), and using the expression ( |Al| ) for the logarithm we get 



- — - lim 

In 2 n-»o n 



-{t-x/f /2a 2 



V2~7 



dxP(x) / dt 



-{x-t) 2 /2a 2 



V2^ 



(D5) 



The last term in (D5) is nothin g b ut the integral of £(x) over all x, which can be shown to give 1. To carry out the 
integral in t in the first line of (D5) we observe that 



-(t-Xj) 2 /2a 2 



V / 27rcr 2 



( 27 ra 2 r ( " +1)/2 x 



x exp < 



-(»+!) 

2<7 2 



n+l n+l ^ /n+l 

+ l^ Xj ~ 2^ E - ^TT 1 E 

3=1 J=0 \fe=l 



When replacing this expression in (D5), the integration in t can be done right away. The result is [27rcr 2 /(n + l)] 1 ^ 2 . 
Therefore, 



Hi 



•- — - lim — 

in 2 n— >o n 



exp 



-1 
2a 1 



(27ra 2 )-"/ 2 (n + l)- 1/2 / dx e P(x e ) 

n+l - /n+l \ 1 

j=i \fc=i / 



(D6) 



In the same way as in Eq. (A1C), we write 



n+l 



- / n+l \ 2 n+l n+l 

+t 5> = 2(^TTy E E ~ x ™f 

\j=l I V 1 / £ =1 m=1 



(D7) 



Thus, replacing Eq. (D7) in (D6), and making the expansion 

^ n+l n+l 



exp 



4a 2 (n + 1) 



!=1 m=l 



n+l n+l 



1 - 



^—-rrE E^ 

1+1) T -- ' 



4cr 2 (n + 1) ^ ^ 

y 1 1=1 m=l 



Hi can be calculated. The result is 



ffl = ^{i[l + M2^)]^\ 



2 



(D8) 



(D9) 



When i?2 is substracted, Eq. ( j55| ) is obtained. It should be noticed that Eq. ( Pq ) is not an approximation. The 
j-th order in the Taylor expansion of the exponential grows as [n(n + l)p. Therefore, only the linear term gives a 
contribution for n — ► 0. 
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